Machine Learning Project Report
K-Means & DBSCAN Clustering Analysis
Aspiring SDE, AIML Intern
B.Tech (Information Technology)
Final Semester - 8.0/10 CGPA
Muffakham Jah College of Engineering & Technology
Focused on: AI/ML • Software Engineering • Machine Learning
Keen on Artificial Intelligence & Machine Learning
Focus on building end-to-end solutions that combine ML with software engineering best practices
Technical Proficiency:
In today's competitive financial landscape, credit card companies face the challenge of understanding diverse customer behaviors and preferences. With millions of cardholders exhibiting varying spending patterns, payment behaviors, and financial needs, a one-size-fits-all approach to customer service and marketing is no longer effective.
This project addresses these challenges by implementing unsupervised machine learning techniques to segment credit card customers into meaningful groups, enabling data-driven decision making for personalized customer experiences and improved business outcomes.
import pandas as pd
import numpy as np
# Load the credit card dataset
df = pd.read_csv('CC_GENERAL.csv')
# Display basic information
print(f"Dataset shape: {df.shape}")
print(f"Features: {df.columns.tolist()}")
print(f"Missing values: {df.isnull().sum().sum()}")
# Display first few rows
df.head()
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values[missing_values > 0])
# Handle missing values
df['MINIMUM_PAYMENTS'].fillna(df['MINIMUM_PAYMENTS'].median(), inplace=True)
df['CREDIT_LIMIT'].fillna(df['CREDIT_LIMIT'].median(), inplace=True)
# Outlier detection using IQR method
def detect_outliers(df, feature):
Q1 = df[feature].quantile(0.25)
Q3 = df[feature].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[feature] < lower_bound) | (df[feature] > upper_bound)]
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Feature engineering
df['PURCHASE_TO_LIMIT_RATIO'] = df['PURCHASES'] / df['CREDIT_LIMIT']
df['CASH_ADVANCE_TO_LIMIT_RATIO'] = df['CASH_ADVANCE'] / df['CREDIT_LIMIT']
df['PAYMENT_TO_BALANCE_RATIO'] = df['PAYMENTS'] / (df['BALANCE'] + 1)
# Remove CUST_ID for clustering
features = df.drop(['CUST_ID'], axis=1)
# Standardize features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
# Apply PCA for dimensionality reduction
pca = PCA(n_components=0.95) # Retain 95% variance
pca_features = pca.fit_transform(scaled_features)
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
# Elbow method for optimal k
def elbow_method(data, max_k=10):
sse = []
for k in range(1, max_k + 1):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(data)
sse.append(kmeans.inertia_)
return sse
# K-Means clustering
kmeans = KMeans(n_clusters=4, random_state=42, init='k-means++')
kmeans_labels = kmeans.fit_predict(scaled_features)
# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(scaled_features)
# Evaluate clustering performance
kmeans_silhouette = silhouette_score(scaled_features, kmeans_labels)
dbscan_silhouette = silhouette_score(scaled_features, dbscan_labels)
print(f"K-Means Silhouette Score: {kmeans_silhouette:.3f}")
print(f"DBSCAN Silhouette Score: {dbscan_silhouette:.3f}")
Silhouette Score
Silhouette Score
K-Means Clusters
/predict - Customer segmentation/batch_predict - Bulk processing/insights - Business analytics/health - System monitoringfrom flask import Flask, request, jsonify, render_template
import pickle
import pandas as pd
import numpy as np
app = Flask(__name__)
# Load trained models
with open('models/kmeans_model.pkl', 'rb') as f:
kmeans_model = pickle.load(f)
with open('models/scaler.pkl', 'rb') as f:
scaler = pickle.load(f)
@app.route('/')
def home():
return render_template('index.html')
@app.route('/predict', methods=['POST'])
def predict():
try:
# Get input data
data = request.get_json()
# Preprocess data
features = pd.DataFrame([data])
scaled_features = scaler.transform(features)
# Make prediction
cluster = kmeans_model.predict(scaled_features)[0]
# Get cluster insights
insights = get_cluster_insights(cluster)
return jsonify({
'cluster': int(cluster),
'segment_name': insights['name'],
'characteristics': insights['characteristics'],
'recommendations': insights['recommendations']
})
except Exception as e:
return jsonify({'error': str(e)}), 400
if __name__ == '__main__':
app.run(debug=True)
This customer segmentation project has successfully transformed raw credit card transaction data into actionable business insights. The implementation of K-Means and DBSCAN clustering algorithms has enabled the identification of four distinct customer segments, each with unique characteristics and behaviors. The deployed Flask web application provides real-time segmentation capabilities, allowing business stakeholders to make data-driven decisions for targeted marketing campaigns and personalized customer experiences.